An Efficient Hash-based Association Rule Mining Approach for Document Clustering

نویسندگان

  • NOHA NEGM
  • PASSENT ELKAFRAWY
چکیده

Document clustering is one of the important research issues in the field of text mining, where the documents are grouped without predefined categories or labels. High dimensionality is a major challenge in document clustering. Some of the recent algorithms address this problem by using frequent term sets for clustering. This paper proposes a new methodology for document clustering based on Association Rules Mining. Our approach consists of three phases: the text preprocessing phase, the association rule mining phase, and the document clustering phase. An efficient Hash-based Association Rule Mining in Text (HARMT) algorithm is used to overcome the drawbacks of Apriori algorithm. The generated association rules are used for obtaining the partition, and grouping the partition that have the same documents. Furthermore, the resultant clusters are effectively obtained by grouping the partition by means of derived keywords. Our approach can reduce the dimension of the text efficiently for very large text documents, thus it can improve the accuracy and speed of the clustering algorithm. Key-Words: Document Clustering, knowledge discovery, Hashing, Association Rule Mining, Text Documents, Text Mining.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new approach based on data envelopment analysis with double frontiers for ranking the discovered rules from data mining

Data envelopment analysis (DEA) is a relatively new data oriented approach to evaluate performance of a set of peer entities called decision-making units (DMUs) that convert multiple inputs into multiple outputs. Within a relative limited period, DEA has been converted into a strong quantitative and analytical tool to measure and evaluate performance. In an article written by Toloo et al. (2009...

متن کامل

Investigate the Performance of Document Clustering Approach Based on Association Rules Mining

The challenges of the standard clustering methods and the weaknesses of Apriori algorithm in frequent termset clustering formulate the goal of our research. Based on Association Rules mining, an efficient approach for Web Document Clustering (ARWDC) has been devised. An efficient Multi-Tire Hashing Frequent Termsets algorithm (MTHFT) has been used to improve the efficiency of mining association...

متن کامل

Clustering Web Documents based on Efficient Multi-Tire Hashing Algorithm for Mining Frequent Termsets

Document Clustering is one of the main themes in text mining. It refers to the process of grouping documents with similar contents or topics into clusters to improve both availability and reliability of text mining applications. Some of the recent algorithms address the problem of high dimensionality of the text by using frequent termsets for clustering. Although the drawbacks of the Apriori al...

متن کامل

Applying a decision support system for accident analysis by using data mining approach: A case study on one of the Iranian manufactures

Uncertain and stochastic states have been always taken into consideration in the fields of risk management and accident, like other fields of industrial engineering, and have made decision making difficult and complicated for managers in corrective action selection and control measure approach. In this research, huge data sets of the accidents of a manufacturing and industrial unit have been st...

متن کامل

An Efficient Association Rule Mining Using the H-BIT Array Hashing Algorithm

Association Rule Mining (ARM) finds the interesting relationship between presences of various items in a given database. Apriori is the traditional algorithm for learning association rules. However, it is affected by number of database scan and higher generation of candidate itemsets. Each level of candidate itemsets requires separate memory locations. Hash Based Frequent Itemsets Quadratic Pro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012